Skip to content

Conversation

@kfirwolfson
Copy link

@kfirwolfson kfirwolfson commented Sep 9, 2025

[V1][Core] Add a cache hit threshold for requests

Purpose

Introduce an optional KV-cache hit-rate gating mechanism, discussed in RFC #24256, to skip requests that are unlikely to benefit from prefill in P/D disaggregated deployments.

Edit: an additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

What this PR adds

  • Global setting: --global-cache-hit-threshold ([0.0–1.0], default 0.0)
  • Per-request override: cache_hit_threshold ([0.0–1.0]) in incoming request ChatCompletionRequest / CompletionRequest (validated in the protocol layer).
  • Finish reason: New enum value and string "cache_threshold" exposed via v1 engine API. Requests rejected by this gating return HTTP 200 with finish_reason="cache_threshold" and no output tokens.
  • Config visibility & hashing: Threshold is included in VllmConfig and SchedulerConfig.
  • Bounds & validation: All threshold values validated to range [0.0, 1.0].

Why

  • Enables Decode-first optimization in P/D disaggregation: when computed-token ratio (local+external) over prompt length is below the threshold, we avoid scheduling low-benefit prefills on decode nodes. This reduces wasted work and remote KV transfers when cache reuse is insufficient.

Backwards compatibility

  • Default is 0.0 → feature is disabled by default. No behavior change unless the threshold is set globally or per request.

Test Plan

1) Unit Tests

Unit tests check the scheduler logic, including

  • request threshold overrides global threshold
  • cache hits from local or external KV cache, or both

2) E2E manual tests

Run vllm serve with --global-cache-hit-threshold 0.8 argument to set a some default value. We'll override it in most requests.

vllm serve <model_path> --served-model "Llama-3.1-8B-Instruct" --global-cache-hit-threshold 0.8

Scheduler computes hit_ratio = computed_tokens / prompt_tokens

We will send 4 requests. Note the order of sending them matters as the first request fills the cache other depend on

  • First request with cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache
  1. Request1: Short 26 tokens will be the prefix of future requests.
  • Following requests are sent with cache_hit_threshold: 0.33
  1. Request2: Long prompt ≈ 58 tokens → ratio 16/58 ≈ 0.28 → rejected as ratio below threshold
  2. Request3: Medium prompt ≈ 40 tokens → ratio 16/40 ≈ 0.4 → normal generation
  • The next request is sent without a cache_hit_threshold field, which means global value of 0.8 will take effect
  1. Request4: Medium prompt ≈ 39 tokens → ratio 16/39 ≈ 0.41 → rejected as ratio below global threshold

Request 1) Warm the cache

This run uses cache_hit_threshold: 0 so it’s guaranteed to execute and populate the KV-cache for the base segment.

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size",
    "max_tokens": 20,
    "cache_hit_threshold": 0
  }'

Request 2) MISS case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to fill the default block size. Then we continue with many words so that the token length will exceed 16*3 and cache hit rate will be too low to pass the test case threshold",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"


Request 3) HIT case

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix but continue with with whatever text tokens we like and keep it medium after all",
    "max_tokens": 20,
    "cache_hit_threshold": 0.33
  }'

Expected: normal generation ("finish_reason" is not "cache_threshold").

Request 4) MISS case using global threshold

Use global threshold set to 0.8

curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Llama-3.1-8B-Instruct",
    "prompt": "This is the beginning of a long prompt with many tokens, we need a min of 16 to be the shared prefix and now continue with different text so the hit rate will be too low",
    "max_tokens": 20
  }'

Expected: HTTP 200 with "finish_reason": "cache_threshold"

Notes

  • Exact token counts can vary slightly by tokenizer/model; we go the numbers above using Llama-3.1-8B-Instruct

Test Result

E2E Local smoke tests on a single node:

  • Below threshold: responses returned 200 with finish_reason: "cache_threshold" and empty outputs.
    • Validated with debug logs
    • Request threshold:
      • Request cmpl-410004b615a54d73b7e9f0deebf2b852-0 rejected: cache hit rate 0.28 < threshold 0.33 (request)
    • Global threshold:
      • Request cmpl-6d66ba796f9247fcadca54ae428bf790-0 rejected: cache hit rate 0.41 < threshold 0.80 (global)
  • At/above threshold: normal token generation.
  • Validators rejected out-of-range values and accepted on boundaries 0.0 and 1.0 (not detailed above)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cache hit threshold to gate requests, which is a useful optimization for disaggregated deployments. The implementation is mostly solid, covering configuration, API exposure, and the core scheduling logic.

I've identified a critical issue that could lead to a ZeroDivisionError in the scheduler when processing requests with empty prompts. Additionally, there's a code duplication issue in the protocol validation that should be addressed to improve maintainability. My detailed comments provide suggestions for fixing these issues.

@github-actions
Copy link

github-actions bot commented Sep 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 3425995 to 7c0485e Compare September 9, 2025 16:31
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0b75346 to 8be6b61 Compare September 14, 2025 05:58
@robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat self tag

@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 8be6b61 to 0400566 Compare September 30, 2025 10:24
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 4d756b7 to 0c15acc Compare September 30, 2025 12:59
@mergify
Copy link

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 3, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c15acc to 0c9cb3f Compare October 6, 2025 06:06
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 0c9cb3f to c087238 Compare October 6, 2025 06:38
@mergify mergify bot removed the needs-rebase label Oct 6, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch 2 times, most recently from 06abf34 to eeae693 Compare October 6, 2025 22:53
@kfirwolfson
Copy link
Author

(also added to the PR description above)

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up. The main problem is that the external router (such as llm-d / Dynamo / Production Stack) orchestrating PD has no control over this vLLM behavior once the Decode instance received the request. Setting a small cache hit-rate threshold on the request (say 0.001), will reject this Prefill work in case of preemption, and the request will be sent back to the calling Router / Side Car / Worker.

@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kfirwolfson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 908c9fd to e391053 Compare October 16, 2025 09:37
@mergify mergify bot added ci/build and removed needs-rebase labels Oct 16, 2025
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch 3 times, most recently from b2553c7 to 97698ba Compare October 16, 2025 10:20
@markmc
Copy link
Member

markmc commented Oct 16, 2025

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up.

xref #26813 - a proposal to add a policy that if a request fails because remote KV can't be loaded, we just abort the request rather than falling back to doing the prefill work in the decode instance

@kfirwolfson
Copy link
Author

An additional useful scenario for this capability is Request Preemption in P/D disaggregated deployments on a Decode instance. The scenario manifested in llm-d PD tests and involved PD requests that get preempted on the Decode instance: today vLLM simply evacuates such requests KVCache blocks and later retries the requests from scratch. This means the full Prefill work is done internally inside the Decode instance, including all new (possibly many) Output tokens. Tests in the field showed this case leads to Decoders starting to execute prefills and eventually lock up.

xref #26813 - a proposal to add a policy that if a request fails because remote KV can't be loaded, we just abort the request rather than falling back to doing the prefill work in the decode instance

Good catch, @markmc. I'll comment there - we can possibly join forces. Are you reviewing this PR as well?

@elevran
Copy link

elevran commented Oct 19, 2025

@kfirwolfson
nit: might be clearer to caller if the return code is not 200/success?
As currently defined, requires the inspection of the payload before the request is rerouted/retried elsewhere. Error codes 429 (too many requests) or 503 (service unavailable) along with a custom header might be friendlier to the routing layer / sidecar.

@kfirwolfson
Copy link
Author

kfirwolfson commented Oct 19, 2025

@elevran it's a good question what code to return. In the preemption use-case we can gain from a 200 response by attaching the output tokens. Please see the "Optimization (phase 2)" section under RFC #24256. The alternative mentioned there is 422.

@orozery
Copy link
Contributor

orozery commented Oct 27, 2025

@kfirwolfson looks very good to me! Thanks!

Kfir Wolfson added 3 commits November 3, 2025 17:43
Fix Gemini CR comments
Add unit tests
Move from SamplingParams to request
unit test remake
fix static code analysis rejects
Fix unit test
fix after local CR
fix pre-commit reject
add threshold to request logger and fix some calls to encode
fix ruff

Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
Signed-off-by: Kfir Wolfson <kfirw@pliops.com>
@kfirwolfson kfirwolfson force-pushed the feature/kv-cache-hit-threshold branch from 97698ba to da39332 Compare November 4, 2025 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants